Duke University
Canonical Regression Model & Bayesian Model Uncertainty
Estimation via MCMC Monte Carlo Frequencies
Probability Proportional to Size Sampling in Finite Populations
Adaptive Independent Metropolis/Adaptive Importance Sampling
Illustration
Observe response vector \(\mathbf{Y}\) with predictor variables \(X_1 \dots X_p\).
Model for data under a specific model \({\cal M}_\gamma\): \[\begin{equation*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta_\gamma}, \phi, {\cal M}_\gamma\sim \textsf{N}(\mathbf{1}_n\alpha + \mathbf{X}_{\boldsymbol{\gamma}}\boldsymbol{\beta_\gamma}, \mathbf{I}_n/\phi) \label{eq:linear.model} \end{equation*}\]
Models \({\cal M}_\gamma\) encoded by \(\boldsymbol{\gamma}= (\gamma_1, \ldots \gamma_p)^T\) binary vector with \(\gamma_j = 1\) indicating that \(X_j\) is included in model \({\cal M}_\gamma\) where \[\begin{align*} \gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\ \gamma_j = 1 & \Leftrightarrow \beta_j \neq 0 \end{align*}\]
\(\mathbf{X}_{\boldsymbol{\gamma}}\): the \(n \times p_{\boldsymbol{\gamma}}\) design matrix for model \({\cal M}_\gamma\)
\(\boldsymbol{\beta_\gamma}\): the \(p_{\boldsymbol{\gamma}}\) vector of non-zero regression coefficients under \({\cal M}_\gamma\)
intercept \(\alpha\), precision \(\phi\) common to all models
\(2^p\) models in the model space \(\boldsymbol{\Gamma}\)
prior distributions on all unknowns \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\equiv (\boldsymbol{\beta}_{{\cal M}_\gamma}, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) and \({\cal M}_\gamma\)
reference priors for \(\alpha_{{\cal M}_\gamma}\), \(\phi_{{\cal M}_\gamma}\): \(p(\alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma} \mid {\cal M}_\gamma) \propto 1/\phi_{{\cal M}_\gamma}\)
“conventional” priors for \(\boldsymbol{\beta}_{{\cal M}_\gamma} \mid \phi, {\cal M}_\gamma\)
Mixtures of g-priors: \[\begin{align*} \boldsymbol{\beta}_{{\cal M}_\gamma} \mid g, \phi, {\cal M}_\gamma& \sim \textsf{N}\left(\mathbf{0}, g (\mathbf{X}_{\boldsymbol{\gamma}}^T \mathbf{X}_{\boldsymbol{\gamma}})^{-1}/\phi \right) \\ g \mid {\cal M}_\gamma& \sim \pi(g \mid {\cal M}_\gamma) \end{align*}\]
for mixtures of \(g\)-priors, we can integrate out the parameters \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}= (\boldsymbol{\beta_\gamma}, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) given \(g\) in closed form \[p(\mathbf{Y}\mid {\cal M}_\gamma, g) = \int p(\mathbf{Y}\mid \boldsymbol{\theta}_{\boldsymbol{\gamma}}, {\cal M}_\gamma)p(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\mid g, {\cal M}_\gamma) d\boldsymbol{\theta}_{\boldsymbol{\gamma}}\] \[p(\mathbf{Y}\mid {{\cal M}}_{\boldsymbol{\gamma}}, g) = C (1+g)^{(n-1-p_{\boldsymbol{\gamma}})/2} \big [ 1 + g(1 - R^2_{\boldsymbol{\gamma}} \big]^{-(n-1)/2}\]
marginal likelihood of \({\cal M}_\gamma\) is a one-dimensional integral (Laplace approximation or numerical) proportional to \[p(\mathbf{Y}\mid {\cal M}_\gamma) = \int p(\mathbf{Y}\mid g, {\cal M}_\gamma)\pi(g \mid {\cal M}_\gamma) dg\]
marginal likelihoods for models under mixtures of \(g\)-priors expressed as functions of \(F\)-statistics for testing \(\boldsymbol{\beta_\gamma}= 0\), \(n\) and \(p_{\boldsymbol{\gamma}}\)
provides a calibration of p-values via posterior probabilities of models
\[p({\cal M}_\gamma\mid\ Y) = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in \boldsymbol{\Gamma}} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)}\]
Can’t enumerate all models!
Use a sample of models from \(\boldsymbol{\Gamma}\) to approximate the posterior distribution of models
design a Markov Chain to transition through \(\boldsymbol{\Gamma}\) with stationary distribution \(p({\cal M}_\gamma\mid \mathbf{Y})\) \[ p({\cal M}_\gamma\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma) \]
propose a new model from \(q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})\)
accept moving to \(\boldsymbol{\gamma}^*\) with probability \[ \textsf{MH} = \max(1, \frac{p({\cal M}_\gamma* \mid \mathbf{Y})p({\cal M}_\gamma^*)/q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})} {p({\cal M}_\gamma\mid \mathbf{Y})p({\cal M}_\gamma)/q(\boldsymbol{\gamma}\mid \boldsymbol{\gamma}^*)}) \]
otherwise stay at model \({\cal M}_\gamma\)
models are sampled proportional to their posterior probabilities as \(T \to \infty\)
Estimate the probabilities of models via Monte Carlo frequencies of models or ergodic averages \[\begin{align*} \widehat{p({\cal M}_\gamma\mid \mathbf{Y})} & = \frac{\sum_{t = 1}^T I({{\cal M}}_t = {\cal M}_\gamma)} {T} \\ & = \frac{\sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}} I({\cal M}_\gamma\in S)} {\sum n_{\boldsymbol{\gamma}}} \\ \end{align*}\]
\(T\) = # MCMC samples
\(S\) is the collection of unique sampled models
\(n_{\boldsymbol{\gamma}}\) is the frequency of model \({\cal M}_\gamma\) in \(S\)
\(n = \sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}}\) total number of unique models in the sample
asymptotically unbiased as \(T \to \infty\)
BMA/BVS has a reputation of being slow and computationally intensive!
Porwal and Raftery (PNAS 2022) compare a range of state of the art methods and standards in high-dimensional problems
21 methods based on available R packages
BMA methods are competitive with Lasso in terms of computational speed and better than Lasso and others in terms of prediction and model selection!
fundamentally unsound to a Bayesian ! (O’Hagan 1987, The Statistician)
in high-dimensional problems many models are rarely or never sampled (high variance but unbiased)
ignores observed information in the marginal likelihoods \(\times\) prior probabilities!
alternatives based on estimating the normalizing constant \[C = \sum_{\boldsymbol{\gamma}\in {\cal{S}}} p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)\]
renormalized observed marginal likelihoods \(\times\) prior probabilities of sampled models in \({\cal{S}}\)
can we do better using ideas from Finite Population Sampling?
Can view Metropolis-Hasting sampling from \(\boldsymbol{\Gamma}\) (in the limit) as a form of Probability Proportional to Size Sampling (PPS) With Replacement from \(\boldsymbol{\Gamma}\)
Let \(q({{\cal M}}_i)\) be the probability of sampling \({{\cal M}}_i\)
Goal is to estimate \[C = \sum_i^N p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\] where \(N = |\boldsymbol{\Gamma}|\)
Finite Population Sampling Estimators
Hansen-Hurwitz (1943) may be viewed as an importance sampling estimate \[\hat{C} = \frac{1}{n}\sum_{i=1}^n \frac{ n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)} \]
If we have “perfect” samples from the posterior then \(q({{\cal M}}_i) = \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{C}\) and recover \(C\)!
Since \(C\) is unknown, apply the ratio HH estimator (or self-normalized IS) \[ \hat{C} = \frac{\frac1 n \sum_i^n \frac{n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)}}{ \frac1 n \sum_i^n \frac{1}{q({{\cal M}}_i)}} = \left[ \frac{1}{n} \sum_i \frac{n_i}{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)} \right]^{-1} \]
This recovers the harmonic mean estimator of Newton & Raftery (1994) - while unbiased, it’s is highly unstable!
FPS estimates such as HH that depend on MC frequencies
HT estimate of normalizing constant: \[\quad \hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}\]
\(\pi_i\) is the inclusion probability that \(\boldsymbol{\gamma}_i \in S\)
under sampling with replacement \(\pi_i = 1 - (1 - q({{\cal M}}_i))^\text{T}\)
Horvitz-Thompson dominates Hansen-Hurvitz (smaller variance) and is the unique hyper-admissible estimate of \(C\) (Joshi, 1972; Ramakrishnan, 1973)
Basu’s (1971) famous circus example illustrated potential problems with the Horvitz-Thompson estimator (similar problem arises with IS)
violates the likelihood principle
once we have samples, \(p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) are fixed and the sampling probabilities are not relevant
only randomness is for the remaining units that were not sampled. (which is related to the sampling design)
Basu’s estimate under SWOP (using \(\pi_i \propto A_i = q({{\cal M}}_i)\)), \[C = \sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \left( \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{\pi_i} \right) \times \left(\sum_{i \notin S} \pi_i \right)\]
conditions on the observed data sum and estimates remaining
Basu (1971)’s estimate of the total can be justified as a “super-population” Model Based approach (Meeden and Ghosh, 1983)
Let \(m_i = p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) \[\begin{align} m_i \mid \pi_i &\mathrel{\mathop{\sim}\limits^{\rm ind}}N(\pi_i \eta, \tau^2 \pi_i^2) \\ p(\eta, \sigma^2) & \propto 1/\tau^2 \end{align}\]
posterior mean of \(\eta\) is \(\hat{\eta} = \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\) (the HT of the total)
using the posterior predictive for \(m_i \notin S\), \(\textsf{E}[m_i \mid m_j \in S] = \pi_i \hat{\eta}\) \[\begin{align*} C & = \sum_{i \in \boldsymbol{\Gamma}} m_i = \sum_{i \in S} m_i + \sum_{i \notin S} m_i \\ \hat{C} & = \sum_{i \in S} m_i + \sum_{i \notin S} \hat{\eta} \pi_i = \sum_{i \in S} m_i + \left[\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i} \right] \sum_{i \notin S} \pi_i \end{align*}\]
estimate of posterior probability \({\cal M}_\gamma\) for \({\cal M}_\gamma\in S\) \[ \frac{p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
estimate of all models in \(\boldsymbol{\Gamma}- S\) from the predictive distribution \[ \frac{\frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]
Uses renormalized marginal likelihoods of sampled models
easy to compute marginal inclusion probabilities
Can also derive \(\textsf{E}[\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{X}\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{Y}^* \mid \mathbf{Y}]\) or \(p(\Delta \mid \mathbf{Y})\)
still have to come up with a good \(q({{\cal M}}_i)\)!
Griffin et al (2021):
Independent Metropolis Hastings \[q(\boldsymbol{\gamma}) = \prod \pi_i^{\gamma_i} (1 - \pi_i)^{1-\gamma_i}\]
Product of Independent Bernoullis is optimal for target where \(\gamma_j\) are independent a posteriori
Use adaptive Independent Metropolis Hastings to learn \(\pi_i\) from past samples
uses Metropolis-Hastings Ratio to accept proposals + Monte Carlo frequencies to estimate posterior model probabilities
poor approximation for standalone sampling with replacement/importance sampling with increasing correlation in \(\mathbf{X}\)
The joint posterior distribution of \(\boldsymbol{\gamma}\) (dropping \(\mathbf{Y}\)) may be factored: \[p({\cal M}_\gamma\mid \mathbf{Y}) \equiv p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j = 1}^p p(\gamma_j \mid \boldsymbol{\gamma}_{<j}) \] where \(\boldsymbol{\gamma}_{< j}\equiv \{\gamma_k\}\) for \(k < j\) and \(p(\gamma_1 \mid \boldsymbol{\gamma}_{<1}) \equiv p(\gamma_1)\).
As \(\gamma_j\) are binary, re-express as \[\begin{equation*} p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j \mid <j})}^{1-\gamma_j} \end{equation*}\] where \(\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j})\) and \(\rho_{1 \mid < 1} = \rho_1\), the marginal probability.
Product of Dependent Bernoullis
Factor proposal \[q(\boldsymbol{\gamma}) = \prod_{j = 1}^p q(\gamma_j \mid \boldsymbol{\gamma}_{<j}) = \prod_j \textsf{Ber}(\hat{\rho}_{j \mid <j}) \]
Note: \(\Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}) = \textsf{E}[\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}]\)
Fit a sequence of \(p\) regressions \(\gamma_j\) on \(\gamma_{<j}\) \[\begin{align*} \gamma_1 & = \mu_1 + \epsilon_1 \\ \gamma_2 & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\ \gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 - \mu_1) + \beta_{3 2} (\gamma_2 - \mu_2) + \epsilon_3 \\ & \vdots \\ \gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 - \mu_1) \ldots + \beta_{p-1 \, p-1} (\gamma_{p-1} - \mu_{p-1})+ \epsilon_p \end{align*}\]
Approximate model \[\boldsymbol{\gamma}\sim \textsf{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_{\boldsymbol{\gamma}})\]
Wermouth (1980) compositional regression \[ \mathbf{G}= \mathbf{1}_{T} \boldsymbol{\mu}^T + (\boldsymbol{\Gamma}- \mathbf{1}_T \boldsymbol{\mu}^T) \mathbf{B}+ \boldsymbol{\epsilon} \]
\(\mathbf{G}\) is \(T \times p\) matrix where row \(t\) is \(\boldsymbol{\gamma}_t\)
\(\boldsymbol{\mu}\) is the \(p\) dimensional vector of \(\textsf{E}[\boldsymbol{\gamma}]\)
\(\boldsymbol{\Sigma}_{\boldsymbol{\gamma}} = \mathbf{U}^T \mathbf{U}\) where \(\mathbf{U}\) is upper triangular Cholesky decomposition of covariance matrix of \(\boldsymbol{\gamma}\) (\(p \times p\))
\(\mathbf{B}^T = \mathbf{I}_p - diag(\mathbf{U})^{-1} \mathbf{U}^{-T}\) (lower triangle)
\(\mathbf{B}\) is a \(p \times p\) upper triangular matrix with zeros on the diagonal and regression coefficients for \(jth\) regression in row \(j\)
OLS is BLUE and consistent, but \(\mathbf{G}\) may not be full rank
apply Bayesian Shrinkage with “priors” on \(\boldsymbol{\mu}\) (non-informative or Normal) and \(\Sigma\) (inverse-Wishart)
pseudo-posterior mean \(\boldsymbol{\mu}\) is the current estimate of the marginal inclusion probabilities \(\bar{\boldsymbol{\gamma}} = \hat{\boldsymbol{\mu}}\)
use pseudo-posterior mean for \(\boldsymbol{\Sigma}\)
one Cholesky decomposition provides all coefficients for the \(p\) predictions for proposing \(\boldsymbol{\gamma}^*\)
constrain predicted values \(\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)\)
generate \(\boldsymbol{\gamma}^*_j \mid \boldsymbol{\gamma}^*_{< j} \sim \textsf{Ber}(\hat{\rho}_{j \mid <j})\)
use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all) -or- Sampling Without Replacement (future)
tecator data (Griffin et al (2021))
a sample of \(p = 20\) variables
compare enumeration to
Want to avoid MCMC for
avoid infinite regret
more general models?
other mean/variance assumptions for the super-population model lead to other estimates for \(C\), \(p({\cal M}_\gamma\mid \mathbf{Y})\), etc
Code in development version of BAS on https://github.com/merliseclyde/BAS